An RNA-Seq workflow for one independent variable analysis
FASTQ Files Quality AssessmentFASTQ Files Quality Trimming!!! Coding style ~ 我有盡量檢查了,但是真的好多Q我怕我有沒有按照的地方,學姊可以幫我注意一下~ code, file suffix, value, parameter ‘file directory’ ‘operating system’ Normal software name
RNA-Seq(RNA sequencing) uses next-generation sequencing(NGS) to provide transcriptome profiling. It has changed our thought about how complex eukaryotic transcriptomes are(Mortazavi et al. 2008).(RNA-Seq Blog). This technique is widely used to look at alternative gene spliced transcripts, post-transcriptional modification, SNPs over time, differences in gene expression in different groups or treatment etc.(wiki). Among them, differential expressed gene(DEG) analysis is the most common downstream analysis.
RNASeqWorkflow provides an easier and R-based approach for running end-to-end two-group RNA-Seq analysis from quality assessment, trimming to downstream differential expression analysis, functional, pathway analysis. Some important features include automated workflow, comprehensive visulization reports, organized, extendable file structure and supporting background process in each step etc.
The following are main tools that are used in this package: ‘HISAT2’ for reads alignment(Kim et al. 2015); ‘StringTie’ for alignments assembly and transcripts quantification(Pertea et al. 2015); ‘Gffcompare’ for comparing merged GTF file with reference GTF file; ‘systemPipeR’ package for quality assessment(Backman et al. 2016); ‘ShortRead’ package for quality trimming(Morgan et al. 2009); ‘ballgown’ package(Fu et al. 2018), ‘TPM & Student’s t-test’, ‘DESeq2’ package(Love et al. 2014) and ‘edgeR’ package(Robinson et al. 2010;McCarthy et al. 2012) for finding potential DEGs; ‘clusterProfiler’ package(Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.
The central concept for designing RNASeqWorkflow is that each step in RNA-Seq analysis is an R function. First users create a RNASeqWorkFlowParam S4 object for all variable checking. Then run the following functions:
RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet(): Setup RNA-Seq environment.RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment(): Optional. Run quality assessment step.RNASeqQualityTrimming_CMD() or RNASeqQualityTrimming(): Optional. Run quality trimming step.RNASeqReadProcess_CMD() or RNASeqReadProcess(): Run alignment, assembly, quantification, gffcompare and raw reads count generation etc.RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis(): Run differential analysis step.RNASeqGoKegg_CMD() or RNASeqGoKegg(): Do GO and KEGG analysis.Functions with CMD suffix create an R script and run nohup R CMD BATCH script.R in background. Functions with no CMD suffix process in R shell. After running the above functions, the whole RNA-Seq analysis is done and generated files in each step will be stored in organized file directory. RNASeqWorkflow package makes two-group RNA-Seq analysis more efficient and easier for users.
The sample data that used in this vignette can be downloaded from RNASeqWorkflowData experiment package. The species is Saccharomyces cerevisiae, and reference FASTA and GTF files are downloaded from iGenomes Ensembl R64-1-1. Six FASTQ files are selected from SRP073391 SRA study(SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, SRR3396387)(Pang CN et al. 2017). To minimize the workflow process time, each raw FASTQ files are already aligned to BAM files first and reads that are in XV:0-180000 chromsome region are selected and created into new mini FASTQ files. This extracted mini data is just for demonstration purpose.
Necessary:
RNASeqWorkflow package. ‘Windows’ is not supported. (Because ‘StringTie’ and ‘HISAT2’ are not available for ‘Windows’)system2() through R shell.Recommended:
2to3: If the Python version in the system is 3, then this command will be used in the following analysis step. Generally, 2to3 is available if Python3 is available. (Python and 2to3 is used for creating raw reads count for DESeq2 and edgeR. If these commands are not available, DESeq2 and edgeR analysis will be skipped.)This is the pre-step of RNA-Seq workflow in this package. Before starting RNA-Seq analysis, you have to run the constructor function RNASeqWorkFlowParam() and create RNASeqWorkFlowParam S4 object first. This object, which stores checked parameters, will be used as input parameter in the following analysis function.
RNASeqWorkFlowParam Slots ExplanationThere are 11 slots in RNASeqWorkFlowParam:
os.type : The operating system type. Value is linux or osx. This package only support ‘Linux’ and ‘macOS’ (no ‘Windows’). If other operating system is detected, ERROR will be reported.python.variable : Python-related variable. Value is a list of whether Python is available and Python version (TRUE or FALSE, 2 or 3).python.2to3 : Availability of 2to3 command. Value is TRUE or FALSE.path.prefix : Path prefix of ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results/’, ‘Rscript/’ and ‘Rscript_out/’ directories. It is recommended to create a new directory with out any file inside and all the following RNA-Seq results will be installed in it.
input.path.prefix : Path prefix of ‘input_files/’ directory. User have to prepare an ‘input_file/’ directory with the following rules:
genome.name.fa : reference genome in FASTA file formation.genome.name.gtf : gene annotation in GTF file formation.FASTQ files.
FASTQ files : ’sample.pattern_1.fastq.gz’ and ’sample.pattern_2.fastq.gz’. sample.pattern must be distinct for each sample.sample.pattern in FASTQ files in ‘raw_fastq.gz/’. Column names must be ids.case.group and control.group. Column name is parameter independent.variable.HT2 indices files for HISAT2 alignment tool.
HT2 indices files corresponding to target reference genome can be installed at HISAT2 official website. Providing HT2 files can accelerate the subsequent steps. It is highly advised to install HT2 files.HT2 index files are not provided, ‘input_files/indices/’ directory should be deleted. genome.name : Variable of genome name defined in this RNA-Seq workflow (ex. ‘genome.name.fa’, ‘genome.name.gtf’)sample.pattern : Regular expression of paired-end fastq.gz files under ‘input_files/raw_fastq.gz/’. IMPORTANT!! Expression shouldn’t have _[1,2].fastq.gz in end.independent.variable: Independent variable for the biological experiment design of two-group RNA-Seq workflow.case.group : Group name of the case group.control.group : Group name of the control group.indices.optional : logical value whether ‘input_files/indices/’ is exit. Value is TRUE or FALSE
RNASeqWorkFlowParam Constructor Checking StepsCreate a new directory for RNA-Seq analysis. It is highly recommended to create a new directory without any files inside. The parameter path.prefix of RNASeqWorkFlowParam() constructor should be the absolute path of this new directory. All the RNA-Seq related files that generated in the following steps will be stored inside of this directory.
Create valid ‘input_files/’ directory. You should create a file directory named ‘input_files/’ with neccessary files inside. It should follow the rules metioned above.
RNASeqWorkFlowParam S4 object. This constructor will check the validity of input parameters before creating S4 objects.
library(RNASeqWorkflow)
library(RNASeqWorkflowData)input_files.path <- system.file("extdata/", package = "RNASeqWorkflowData")
rnaseq_result.path <- "/tmp/yeast_example/"
dir.create(rnaseq_result.path)
list.files(input_files.path, recursive = TRUE)## [1] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.fa"
## [2] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.gtf"
## [3] "input_files/phenodata.csv"
## [4] "input_files/raw_fastq.gz/SRR3396381_XV_1.fastq.gz"
## [5] "input_files/raw_fastq.gz/SRR3396381_XV_2.fastq.gz"
## [6] "input_files/raw_fastq.gz/SRR3396382_XV_1.fastq.gz"
## [7] "input_files/raw_fastq.gz/SRR3396382_XV_2.fastq.gz"
## [8] "input_files/raw_fastq.gz/SRR3396384_XV_1.fastq.gz"
## [9] "input_files/raw_fastq.gz/SRR3396384_XV_2.fastq.gz"
## [10] "input_files/raw_fastq.gz/SRR3396385_XV_1.fastq.gz"
## [11] "input_files/raw_fastq.gz/SRR3396385_XV_2.fastq.gz"
## [12] "input_files/raw_fastq.gz/SRR3396386_XV_1.fastq.gz"
## [13] "input_files/raw_fastq.gz/SRR3396386_XV_2.fastq.gz"
## [14] "input_files/raw_fastq.gz/SRR3396387_XV_1.fastq.gz"
## [15] "input_files/raw_fastq.gz/SRR3396387_XV_2.fastq.gz"
Check the files in ‘inputfiles/’ directory.
exp <- RNASeqWorkflowParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control")In this example, RNASeqWorkFlowParam S4 object is store in exp for subsequent RNA-Seq analysis steps. Any ERROR occured in checking steps will terminate the program.
This is the first necessary step of RNA-Seq workflow in this package. To set up the environment, run RNASeqEnvironmentSet_CMD() to execute process in background or run RNASeqEnvironmentSet() to execute process in R shell.
path.prefix directory. Here are the usage of these five main directories:
XXX_CMD() function, corresponding R script(XXX.R) for certain step will be created in the directory.XXX.Rout) will be stored in this directory.path.prefix directory.The operating system of your workstation will be detected. If the operating system is other than Linux and macOS, ERROR will be reported. You can choose whether you want to install programs automatically or not. Four programs, HISAT2, SAMtools, StringTie and Gffcompare, are necessary for the following analysis workflow. In RNASeqEnvironmentSet_CMD() and RNASeqEnvironmentSet() functions, there are four parameters, install.hisat2, install.stringtie, install.gffcompare, and install.samtools, that you can set to determin whether particular softwares are going to be installed automatically or not. Defualt values are TRUE and these four programs will be installed by default. If you want to skip the installation process of particular program, you can set the parameter value to FALSE. But please make sure that the program you skipped in installation process is available by running system2() through R shell. Availability of these commands will be checked at the end of this environment setup function. Any unavailability of the commands will cause fail in ‘Environment Setup’ step.
If the parameter values are TRUE, these four tools will be installed: * HISAT2 * Based on your operating system, hisat2-2.1.0-Linux_x86_64.zip or hisat2-2.1.0-OSX_x86_64.zip zipped file will be installed. * Installed file will be unzipped and all binary files will be copied under ‘RNASeq_bin/’ * SAMtools * samtools-1.8.tar.bz2 source files will be installed. * Installed files will be unzipped and compiled. If there is any ERROR occurred during compilation, program will be stopped. You should check whether the development environment is OK for SAMtools before rerunning RNASeqEnvironmentSet_CMD or RNASeqEnvironmentSet() to set up the environment again. * Maked SAMtools binary will be copied under ‘RNASeq_bin/’. * StringTie * Based on your operating system, stringtie-1.3.4d.Linux_x86_64.tar.gz or stringtie-1.3.4d.Linux_x86_64 zipped file will be installed. * Installed file will be unzipped and all binary files will be copied under ‘RNASeq_bin/’. * Gffcompare * Based on your operating system, gffcompare-0.10.4.Linux_x86_64.tar.gz or gffcompare-0.10.4.Linux_x86_64.tar.gz zipped file will be installed. * Installed file will be unzipped and all binary files will be copied under ‘RNASeq_bin/’.
‘RNASeq_bin/’ will be added to R PATH so that these binaries can be found in R environment. In the last step of environment setting, hisat2 --version,stringtie --version,gffcompare --version,samtools --version commands will be checked in order to make sure the environment is correctly constructed.
Run RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet(). 1. Run in background Result will be reported in ‘Rscript_out/Environment_Set.Rout’. Make sure the environment is successfully set up before running the subsequent steps.
RNASeqEnvironmentSet_CMD(exp)RNASeqEnvironmentSet(exp@path.prefix,
exp@input.path.prefix,
exp@genome.name,
exp@sample.pattern,
exp@indices.optional, exp@os.type)FASTQ Files Quality AssessmentThis is an optional step of RNA-Seq workflow in this package. However, it is strongly recommended to do it before processing reads. To assess the quality of raw reads in FASTQ files, run RNASeqQualityAssessment_CMD() to execute process in background or run RNASeqQualityAssessment() to execute process in R shell.
In this process, systemPipeR package is used for quality assessment. Here are the following steps:
FASTQ files and create ‘fastqReport.pdf’ as the report result of quality assessmentRemove ‘rnaseq/’ directory.
This quality assessment result is generated by systemPipeR package. It will be stored as
PDF.
Run RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment(). 1. Run in background Result will be reported in ‘Rscriptout/Quality_Assessment.Rout’. Make sure quality assessment is successfully done before running the subsequent steps.
RNASeqQualityAssessment_CMD(exp)RNASeqQualityAssessment(exp@path.prefix,
exp@input.path.prefix,
exp@sample.pattern)## [1] "Generated rnaseq directory. Next run in rnaseq directory, the R code from *.Rmd (*.Rnw) template interactively. Alternatively, workflows can be exectued with a single command as instructed in the vignette."
FASTQ Files Quality TrimmingThis is an optional step of RNA-Seq workflow in this package. However, it is recommended to do it if the result of quality assessment is unsatisfied. To trim the raw reads in FASTQ files, run RNASeqQualityTrimming_CMD() to execute process in background or run RNASeqQualityTrimming() to execute process in R shell. After trimming process, it is strongly recommended to redo quality assessment step. Different quality assessment result will be stored in different directories with index of processing order.
In this process, ShortRead package is used for quality trimming. Followings are trimming steps and method explanation.
FASTQ files with sample.pattern in ‘gene_data/raw_fastq.gz’ will be listed and be check. This trimming function only support paired-end FASTQ files.FASTQ files and the cumulative probability of error of each base pair will also be calculated. The threshold cumulative probability cum.error, with default value 1, is used to find the trimming point of paired end files.reads.length.limit, with default value 36, determines the minimum base pair length of reads. Length of reads less than reads.length.limit will be filtered out.FASTQ files are moved to ‘gene_data/raw_fastq.gz/original_untrimmed_fastq.gz’, and trimmed reads are written to ‘gene_data/raw_fastq.gz/’RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment() for quality assessment after quality trimming.Run RNASeqQualityTrimming_CMD() or RNASeqQualityTrimming(). 1. Run in background Result will be reported in ‘Rscript_out/Quality_Trimming.Rout’. Make sure quality trimming is successfully done before running the subsequent steps.
RNASeqQualityTrimming_CMD(exp)RNASeqQualityTrimming(exp@path.prefix,
exp@sample.pattern)This is a second necessary step of RNA-Seq workflow in this package. Programs installed in environment setting step, HISAT2, SAMtools, StringTie and Gffcompare, will be executed. To process raw reads FASTQ files, run RNASeqRawReadProcess_CMD() to execute process in background or run RNASeqRawReadProcess() to execute process in R shell. For more details about commands that executed during each reads processing step, please check the reported ‘RNASeq_results/COMMAND.txt’. The process steps are mainly refering to this paper(Pertea et al. 2016).
In pre-step(RNASeqWorkFlowParam creation step), ‘indices/’ directory is checked and the whether HT2 indices files are provided is known. If not provided, following commands will be executed:
genome.name.gtf’, ‘genome.name.fa’genome.name.ss’, ‘genome.name.exon’, ’genome.name_tran.{number}.ht2’extract_splice_sites.py, extract_exons.py execution
hisat2-build index creation
genome.name.ss and genome.name.exon created in the previous step. Be aware that index building step requires a larger amount of memory and longer time than steps, and it might not be possible to run on some personal workstations. It is highly recommended to check the availibility of HT2 indices files at HISAT2 official website for your target reference genome beforehands. Install HT2 indices files will greatly shorten the analysis time.genome.name_tran.{number}.ht2’, ‘sample.pattern.fastq.gz’sample.pattern.sam’hisat2 command is executed on paired end FASTQ files. SAM files will be created.
SAM files are stored in ‘gene_data/raw_bam/’.SAM to BAMsample.pattern.sam’sample.pattern.bam’samtools command is executed. Output SAM files from HISAT2 will be converted to BAM files.
BAM files are stored in ‘gene_data/raw_sam/’.genome.name.gtf’, ‘sample.pattern.bam’sample.pattern.gtf’stringtie command is executed.
GTF files which are from each FASTQ files are stored in ‘gene_data/raw_gtf/’GTFsample.pattern.gtf’stringtie command is executed.
sample.pattern.gtf into stringtiemerged.gtfgenome.name.gtf’, ‘stringtie_merged.gtf’gffcompare command is executed.
GTF file and reference annotation file is reported under ‘merged/’ directory.stringtie command is executed.
TSV file by each sample name in ‘gene_data/gene_abundance/’.Whether this step is executed depends on the availability of Python on your workstation.
prepDE.pyprepDE.py is executed.2to3 command will be checked.(Usally, if Python3 is installed, 2to3 command will be installed too.)2to3 command is unavailable, raw reads count generation step will be skipped.2to3 command is available, prepDE.py is converted to file that can be executed by Python2 and be executed.Run RNASeqReadProcess_CMD() or RNASeqReadProcess(). 1. Run in background Result will be reported in ‘Rscriptout/Raw_Read_Process.Rout’. Make sure raw read process is successfully done before running the subsequent steps.
RNASeqReadProcess_CMD(exp)python.variable <- exp@python.variable
python.variable.answer <- python.variable$check.answer
python.variable.version <- python.variable$python.version
RNASeqReadProcess(exp@path.prefix,
exp@input.path.prefix,
exp@genome.name,
exp@sample.pattern,
python.variable.answer,
python.variable.version,
exp@python.2to3,
num.parallel.threads = 10,
exp@indices.optional)## [1] "SRR3396381_XV /tmp/yeast_example/gene_data/ballgown/SRR3396381_XV/SRR3396381_XV.gtf"
This is the third necessary step of RNA-Seq workflow in this package. To do differential expression analysis, run RNASeqDifferentialAnalysis_CMD() to execute process in background or run RNASeqDifferentialAnalysis() to execute process in R shell. In RNASeqWorkflow package, we provide four ways to do differential expression analysis: ballgown, TPM & Student t-test, DESeq2 and edgeR. Gene ids reported by StringTie(raw reads count) and ballgown(result) will be mapped to ‘gene_name’ provided in GTF file. ‘gene_name’ will be used to do gene functional analysis and pathway analysis in the next step.(!!!Functional 還有一些問題~ 這週要和學姊討論一下!!!)
If you run RNASeqRawReadProcess_CMD() in the previous step, ‘Rscript_out/Read_Process.Rout’ will be created and in the beginning of this step, alignment results, aCSV file and a PNG file, will be reported based on ‘Rscript_out/Read_Process.Rout’ file.
‘ballgown’, ‘TPM & Student t-test’, ‘DESeq2’ and ‘edgeR’ will be processed in order in this step. After doing differential expression analysis, results of each method will be visualized and stored in ‘RNASeq_results/’. The pictures showed in the following steps are the visualization result of data in RNASeqWorkflowData experiment package. The detail steps and the method-specific plots in each method will be discussed sperately later. Here we focus on the some general plots that will be created and stored in the directories of these four differential analysis tools. We only show ballgown result plots as an example. The following are the general plots(ballgown):
‘Frequency_Plot_normalized_count_ggplot2.png’ The frequency of normalized reads count(In ballgown is FPKM). The range of x axis(FPKM) is from 0% of data minus 20(smallest FPKM values - 20) to 90% of data plus 20. It is drawn by ggplot2.
‘Frequency_Plot_log_normalized_count_ggplot2.png’ The frequency of log base two of normalized reads count plus one(in ballgown is log2(FPKM+1)). The range of x axis(log2(FPKM+1)) is from 0% of data minus 5(smallest log2(FPKM+1) - 5) to 99% of data plus 5. It is drawn by ggplot2.
‘Box_Plot_ggplot2.png’ The distribution of log base two of normalized reads count plus one(in ballgown is log2(FPKM+1)) depicting groups of numerical . Case is on left-hand side in turquoise and control is on right-hand side in yellow. It is drawn by ggplot2.
‘Violin_Plot_ggplot2.png’ The distribution of log base two of normalized reads count plus one(in ballgown is log2(FPKM+1)) of each sample. Case is on left-hand side in green and control is on right-hand side in yellow. It is drawn by ggplot2.
‘Dimension_PCA_Plot_factoextra.png’ Percentage of explained variances in top five dimensions are plotted. The top two percentage of explained variance will be used to plot PCA plot.
‘PCA_Plot_factoextra.png’ This PCA plot is drawn by fviz_pca_ind() function in factoextra. Case group is in green and control group is in yellow. The smaller points represent the individual samples and the bigger points represent the average of case group samples and control group samples. Ellipses will be added for case and control group samples.
‘PCA_Plot_ggplot2.png’ This PCA plot is drawn directly by ggplot2 with principal component scores calculated by FactoMineR package. Case group is in green and control group is in yellow.
‘Correlation_Heat_Plot_ggplot2.png’ Correlations between each pair of samples is calculated by using R Stats package and drawn by ggplot2. Maximum correlation(1) is in red and minimum correlation in all sample pairs is in blue.
‘Correlation_Dot_Plot_corrplot.png’ Correlation dot plot is drawn by corrplot() function in corrplot package. Maximum correlation(1) is in red and minimum correlation in all sample pairs is in blue.
‘Correlation_Bar_Plot_PerformanceAnalytics.png’ Correlation bar plot is drawn by chart.Correlation() function in PerformanceAnalytics package.
‘Dimension_PCA_Plot_factoextra.png’
‘PCA_Plot_factoextra.png’
‘PCA_Plot_ggplot2.png’
pheatmap() function in pheatmap package. Log base 2 of normalized count plus one(log2(FPKM+1) for ballgown) is calculated first. All calculated sample values substract mean of case group sample values(log2(case_mean_FPKM+1) - log2(each_FPKM+1) ) will be used to drawn this heatamp. Case group samples are on the left in green and control group samples are on the right in yellow. Ballgown is designed to simplify differential expression analysis of RNA-Seq data(more information in ballgown github). The default statistical model is a parametric F-test comparing nested linear models. Normalized row count in ballgown is represented as FPKM values. The following are ballgown analysis steps:
ballgown object is created and written in ‘RNASeqresults/ballgownanalysis/ballgownRobject/ballgown.rda’.0 will be filter out. Moreover, row of extracted statistic results from created ballgown object with pval equals 0 or qval equals 0 will be filterd out too. Log base two fold change(log2(fold change)) value will be calculated as column name log2FC. According to the ‘gene_data/phenodata.csv’ file, normalized count will be seperated into case and control group with column name sample.pattern.FPKM.CSV file combined with normalized FPKM count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/ballgown_analysis/ballgown_normalized_result.csv’.pval < 0.05 and log2FC > 1 | log2FC < 1.log2FC and Y axis log2(FPKM.mean). It is drawn by ggplot2 package. Normalized count is represented as TPM values. The statistic model used here is Student’s t-test. The following are analysis steps:
sample.pattern.FPKM FPKM values.pval.fc. Log base two fold change(log2(fold change)) is also calculated and stored as log2FC.CSV file combined with normalized TPM count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/TPM_analysis/TPM_normalized_result.csv’.pval < 0.05 and log2FC > 1 | log2FC < 1log2FC and Y axis log2(TPM.mean). It is drawn by ggplot2 package. DESeq2 estimate variance-mean dependence in raw reads count( more information inDESeq2 document). The statistic model for differential expression is using negative binomial distribution. DESeq2 package takes sequence depth and RNA composition into consideration and use median of ratios normalization(MRN) method to reads count normalization (related site). The follwing are the analysis steps.
o will be filtered out)DESeqDataSet object from raw reads count of genes by running DESeqDataSetFromMatrix() function in DESeq2.DESeq2() function to do differential expression analysis.CSV file combined with normalized MRN count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/DESeq2_analysis/DESeq2_normalized_result.csv’.pval < 0.05 and log2FC > 1 | log2FC < 1plotDispEsts(), to plot the dispersion estimates. X axis is mean of normalized counts; Y axis is dispersion. mean of normalized counts and Y axis log fold change. It is created by plotMA() function in DESeq2. edgeR, a tool for differential expression analysis, implements a range of statistical method based on negative binomial distribution. edgeR package normalizes for RNA composition by finding a set of scaling factors for the library sizes, which is computed by trimmed mean of M-values(TMM). Normalized count is represented as cpm with library size scaled by TMM (related site). The following are analysis steps:
o will be filtered out)DEGList object from raw reads count by running DGEList() function.DEGList object with TMM by running calcNormFactors() function.estimateCommonDisp() to maximize the negative binomial conditional common likelihood and then run estimateTagwiseDisp() to estimate tagwise dispersion values by an empirical Bayes method.exactTest().cpm() on TMM scaled DEGList object.CSV file combined with normalized TMM count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/edgeR_analysis/edgeR_normalized_result.csv’.pval < 0.05 and log2FC > 1 | log2FC < 1Visualization. The plots mentioned above and the following plots will be drawn.
MeanVar Plot edgeR provides a useful function plotMeanVar() to visualize the mean-variance relationship in DEGList.
BCV Plot edgeR provides a useful function plotBCV() to plot the genewise biological coefficient of variance(BCV) against gene abundace.
MDS Plot edgeR provides a useful function plotMDS.DGEList() to plot samples on a 2D scatterplot with distances representing expression differences between the samples.
Run RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis(). 1. Run in background Result will be reported in ‘Rscriptout/Differential_Analysis.Rout’. Make sure differential expression analysis is successfully finished before running subsequent steps.
RNASeqDifferentialAnalysis_CMD(exp)RNASeqDifferentialAnalysis(exp@path.prefix,
exp@genome.name,
exp@sample.pattern,
exp@independent.variable,
exp@case.group,
exp@control.group,
ballgown.pval = 0.05,
ballgown.log2FC = 1,
TPM.pval = 0.05,
TPM.log2FC = 1,
DESeq2.pval = 0.1,
DESeq2.log2FC = 1,
edgeR.pval = 0.05,
edgeR.log2FC = 1)這邊還有點問題~週五想和學姊討論一下~ 這邊先不用看!~~ This is the third optional step of RNA-Seq workflow in this package. clusterProfiler is used in this step. Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis are made based on the differential expressed genes(DEG) that are found in four different differential analyses, ballgown, TPM&t-test, DESeq2 and edgeR. Run RNASeqGoKegg_CMD() to execute process in background or run RNASeqGoKegg() to execute process in R shell.
The gene names used in StringTie and ballgown are in the type SYMBOL in clusterProfiler. In GO functional analysis and KEGG pathway analysis, SYMBOL ID type will be converted into ENTREZID ID type by bitr() function in clusterProfiler first. Those SYMBOL with no corresponding ENTREZID will return NA and be filtered out. The genes with Inf or -Inf log2 fold change will be filtered out too. ID conversion will be done in each differential analysis tools, ballgown, TPM & Student’s t-test, DESeq2 and edgeR.
Gene Ontology is the framework for the model of biology. It defines the universe of concepts relating to gene functions(GO terms) along three aspects: molecular function(MF), cellular component(CC), biological process(BP), and how these functions are related to each other(relation) (GO website). In this process, GO gene set enrichment analysis, GO classification and GO over-representation test are implemented by clusterProfiler package.
GO GSEA is implemented by running gseGO() function in clusterProfiler. As mentioned in the vignette of clusterProfiler, this approach aggregates the statistic of each gene within gene set, thus make it possible to detect the situation that genes where the difference is small, but evidenced in coordinated way in a set of related genes.
In this step, the input gene set is all genes reported in each differential analysis tools(not just for selected differential expressed genes). If enriched genes are found, the top 5 genes with smallest p-value will be plotted with the function gseaplot() in clusterProfiler. If not, ‘GO_GSE_NO_TERM’ empty file will be created.
The input gene set is differential genes that are reported in each differential analysis tools. groupGO() function in clusterProfiler is used. Classification result will be stored in CSV. The result of classification will be visualized. 12 highest GeneRatio functional description will be selected and plotted as bar plot.
The input gene set is differential genes that are reported in each differential analysis tools. enrichGO() function in clusterProfiler is used. GO over-representation result will be stored in CSV. The result of GO over-representation will be visualized as bar plot and dot plot.
Kyoto Encyclopedia of Genes and Genomes(KEGG) is a database resource for understanding functions and utilities of the biological system from molecular-level information (KEGG website). In this process, KEGG gene set enrichment analysis, KEGG over-representation test are implemented by clusterProfiler package.
KEGG GSEA is implemented by running gseKEGG() function in clusterProfiler. In this step, the input gene set is all genes reported in each differential analysis tools(not just for selected differential expressed genes). If enriched genes are found, the top 5 genes with smallest p-value will be plotted with the function gseaplot() in clusterProfiler. If not, ‘KEGG_GSE_NO_TERM’ empty file will be created.
The input gene set is differential genes that are reported in each differential analysis tools. enrichKEGG() function in clusterProfiler is used. KEGG over-representation result will be stored in CSV. The result of KEGG over-representation will be visualized as bar plot and dot plot.
RNASeqGoKegg_CMD(exp,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")RNASeqGoKegg(exp@path.prefix,
exp@independent.variable,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")toLatex(sessionInfo())## \begin{itemize}\raggedright
## \item R version 3.5.0 (2018-04-23), \verb|x86_64-pc-linux-gnu|
## \item Locale: \verb|LC_CTYPE=en_US.UTF-8|, \verb|LC_NUMERIC=C|, \verb|LC_TIME=en_US.UTF-8|, \verb|LC_COLLATE=C|, \verb|LC_MONETARY=en_US.UTF-8|, \verb|LC_MESSAGES=en_US.UTF-8|, \verb|LC_PAPER=en_US.UTF-8|, \verb|LC_NAME=C|, \verb|LC_ADDRESS=C|, \verb|LC_TELEPHONE=C|, \verb|LC_MEASUREMENT=en_US.UTF-8|, \verb|LC_IDENTIFICATION=C|
## \item Running under: \verb|Ubuntu 17.10|
## \item Matrix products: default
## \item BLAS: \verb|/usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1|
## \item LAPACK: \verb|/home/kuan-hao/anaconda3/lib/libmkl_rt.so|
## \item Base packages: base, datasets, grDevices, graphics, methods,
## parallel, stats, stats4, utils
## \item Other packages: AnnotationDbi~1.43.1, Biobase~2.41.2,
## BiocGenerics~0.27.1, BiocStyle~2.9.6, IRanges~2.15.16,
## RNASeqWorkflow~0.99.0, RNASeqWorkflowData~0.1.0,
## S4Vectors~0.19.19, edgeR~3.23.3, ggplot2~3.0.0, limma~3.37.3,
## org.Hs.eg.db~3.6.0, pathview~1.21.0
## \item Loaded via a namespace (and not attached):
## AnnotationFilter~1.5.2, AnnotationForge~1.23.3, BBmisc~1.11,
## BSgenome~1.49.3, BatchJobs~1.7, BiocManager~1.30.1,
## BiocParallel~1.15.8, Biostrings~2.49.1, Category~2.47.1,
## DBI~1.0.0, DESeq2~1.21.16, DO.db~2.9, DOSE~3.7.1,
## DelayedArray~0.7.30, FactoMineR~1.41, Formula~1.2-3,
## GO.db~3.6.0, GOSemSim~2.7.1, GOstats~2.47.0, GSEABase~1.43.1,
## GenomeInfoDb~1.17.1, GenomeInfoDbData~1.1.0,
## GenomicAlignments~1.17.3, GenomicFeatures~1.33.2,
## GenomicFiles~1.17.2, GenomicRanges~1.33.13, Hmisc~4.1-1,
## KEGGREST~1.21.1, KEGGgraph~1.41.0, MASS~7.3-50, Matrix~1.2-14,
## PerformanceAnalytics~1.5.2, ProtGenerics~1.13.0, QuasR~1.21.2,
## R6~2.2.2, RBGL~1.57.0, RColorBrewer~1.1-2, RCurl~1.95-4.11,
## RSQLite~2.1.1, Rbowtie~1.21.2, Rcpp~0.12.18, Rgraphviz~2.25.0,
## Rqc~1.15.0, Rsamtools~1.33.3, ShortRead~1.39.0,
## SummarizedExperiment~1.11.6, UpSetR~1.3.3,
## VariantAnnotation~1.27.4, XML~3.98-1.16, XVector~0.21.3,
## acepack~1.4.1, annotate~1.59.1, assertthat~0.2.0,
## backports~1.1.2, ballgown~2.13.0, base64enc~0.1-3, bindr~0.1.1,
## bindrcpp~0.2.2, biomaRt~2.37.5, biovizBase~1.29.1, bit~1.1-14,
## bit64~0.9-7, bitops~1.0-6, blob~1.1.1, bookdown~0.7, brew~1.0-6,
## checkmate~1.8.5, cluster~2.0.7-1, clusterProfiler~3.9.2,
## colorspace~1.3-2, commonmark~1.5, compiler~3.5.0, corrplot~0.85,
## cowplot~0.9.3, crayon~1.3.4, curl~3.2, data.table~1.11.4,
## desc~1.2.0, devtools~1.13.6, dichromat~2.0-0, digest~0.6.15,
## doBy~4.6-1, dplyr~0.7.6, enrichplot~1.1.5, ensembldb~2.5.5,
## europepmc~0.3, evaluate~0.11, factoextra~1.0.5, fastmatch~1.1-0,
## fgsea~1.7.1, flashClust~1.01-2, foreign~0.8-71,
## genefilter~1.63.0, geneplotter~1.59.0, ggforce~0.1.3,
## ggplotify~0.0.3, ggpubr~0.1.7, ggraph~1.0.2, ggrepel~0.8.0,
## ggridges~0.5.0, glue~1.3.0, graph~1.59.0, grid~3.5.0,
## gridExtra~2.3, gridGraphics~0.3-0, gtable~0.2.0, hms~0.4.2,
## htmlTable~1.12, htmltools~0.3.6, htmlwidgets~1.2, httpuv~1.4.5,
## httr~1.3.1, hwriter~1.3.2, igraph~1.2.2, jsonlite~1.5,
## knitr~1.20, labeling~0.3, later~0.7.3, lattice~0.20-35,
## latticeExtra~0.6-28, lazyeval~0.2.1, leaps~3.0, locfit~1.5-9.1,
## magrittr~1.5, markdown~0.8, matrixStats~0.54.0, memoise~1.1.0,
## mgcv~1.8-24, mime~0.5, munsell~0.5.0, nlme~3.1-137, nnet~7.3-12,
## org.Mm.eg.db~3.6.0, org.Rn.eg.db~3.6.0, pheatmap~1.0.10,
## pillar~1.3.0, pkgconfig~2.0.2, plyr~1.8.4, png~0.1-7,
## prettyunits~1.0.2, progress~1.2.0, promises~1.0.1, purrr~0.2.5,
## quadprog~1.5-5, qvalue~2.13.0, rafalib~1.0.0, refGenome~1.7.3,
## reshape2~1.4.3, reticulate~1.10, rjson~0.2.20, rlang~0.2.2,
## rmarkdown~1.10, roxygen2~6.1.0, rpart~4.1-13, rprojroot~1.3-2,
## rstudioapi~0.7, rtracklayer~1.41.4, rvcheck~0.1.0, scales~1.0.0,
## scatterplot3d~0.3-41, sendmailR~1.2-1, shiny~1.1.0,
## splines~3.5.0, stringi~1.2.4, stringr~1.3.1, survival~2.42-6,
## sva~3.29.0, systemPipeR~1.15.1, systemPipeRdata~1.9.1,
## tibble~1.4.2, tidyr~0.8.1, tidyselect~0.2.4, tools~3.5.0,
## triebeard~0.3.0, tweenr~0.1.5, units~0.6-0, urltools~1.7.1,
## viridis~0.5.1, viridisLite~0.3.0, withr~2.1.2, xfun~0.3,
## xml2~1.2.0, xtable~1.8-2, xts~0.11-0, yaml~2.2.0,
## zlibbioc~1.27.0, zoo~1.8-3
## \end{itemize}
https://www.rna-seqblog.com/introduction-to-rna-sequencing-and-analysis/
https://en.wikipedia.org/wiki/RNA-Seq
Pang CN et al., “Transcriptome and network analyses in Saccharomyces cerevisiae reveal that amphotericin B and lactoferrin synergy disrupt metal homeostasis and stress response.”, Sci Rep, 2017 Jan 12;7:40232
Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Nature Biotechnology 2015, doi:10.1038/nbt.3122
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621.
Backman TWH, Girke T (2016). “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics, 17(1). doi: 10.1186/s12859-016-1241-0, https://doi.org/10.1186/s12859-016-1241-0.
Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H, Gentleman R (2009). “ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” Bioinformatics, 25, 2607-2608. doi: 10.1093/bioinformatics/btp450, http://dx.doi.org10.1093/bioinformatics/btp450.
Fu J, Frazee AC, Collado-Torres L, Jaffe AE, Leek JT (2018). ballgown: Flexible, isoform-level differential expression analysis. R package version 2.12.0.
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140.
McCarthy, J. D, Chen, Yunshun, Smyth, K. G (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297.
Yu G, Wang L, Han Y, He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), 284-287. doi: 10.1089/omi.2011.0118.
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature protocols, 11(9), 1650. ISO 690
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.